KEP: Coscheduling #2337

k82cn · 2018-07-03T02:00:13Z

Signed-off-by: Da K. Ma klaus1982.cn@gmail.com

/cc @vishh, @bsalamat, @mindprince, @jlewi

/cc @kubernetes/sig-scheduling-feature-requests

etiennecoutaud · 2018-07-03T07:49:51Z

contributors/design-proposals/scheduling/gang-scheduling.md

+
+## References
+
+* [Gang scheduling in Kuberentes](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv)


typo Kuberentes /Kubernetes

bsalamat

Thanks, @k82cn and sorry for the delay in reviewing the PR.

bsalamat · 2018-07-17T23:44:32Z

contributors/design-proposals/scheduling/gang-scheduling.md

+
+## Motivation
+
+After the discussion at [Gang-scheduling](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv) proposal, we decide to make Gang-scheduling API out of core by CRD, and implement it in [kube-arbitrator](https://github.com/kubernetes-incubator/kube-arbitrator). kube-arbitrator focus on "batch" workload in kubernetes, and will share the same [scheduling frameworks](https://github.com/kubernetes/community/pull/2281) when it's ready. This document is used to provide definition of CRD for gang-scheduling, and suggestion of customized job controllers.


s/focus/focuses/

I'd recommend focusing on one feature at a time. If that's gang scheduling, just talk about how k8s would enable it either via default scheduler or a new scheduler under incubation.

bsalamat · 2018-07-17T23:45:40Z

contributors/design-proposals/scheduling/gang-scheduling.md

+
+### CRD Definition
+
+Although workload lifecycle requirements maybe different, e.g. MPI vs. Spark, the requirements to scheduler are similar. To meet both scheduling and lifecycle requirements, two **Kinds** are going to be defined. One new **Kind**, named `SchedulingSpec`, is defined as follow for scheduler:


s/maybe/may be/
s/kind/type/ ?

bsalamat · 2018-07-17T23:48:24Z

contributors/design-proposals/scheduling/gang-scheduling.md

+    // this field from PriorityClassName.
+    // The higher the value, the higher the priority.
+    // +optional
+    Priority *int32


I think you should remove this field from the template. Priority is supposed to be resolved at the time of Pod creation. It exists in PodSpec, but it doesn't make sense to be in the template. This field is not expected to be copied to PodSpec.

Reusing Priority for "Job" here; and the priority of Pod will only take effect within a "Job".

There's the requirements on Job priority, IMO: kubernetes/kubernetes#65746

bsalamat · 2018-07-17T23:54:45Z

contributors/design-proposals/scheduling/gang-scheduling.md

+    // fit on a node. Selector which must match a node's labels for the pod of
+    // 'Job' to be scheduled on that node.
+    // +optional
+    NodeSelector map[string]string


NodeSelector is in a way an earlier and inferior version of NodeAffinity. Why not use NodeAffinity instead of NodeSelector?

Agree, NodeAffinity is better; using NodeSelector because of simplification in prototype :)

bsalamat · 2018-07-17T23:57:31Z

contributors/design-proposals/scheduling/gang-scheduling.md

+}
+```
+
+The `SchedulingSpec` defines necessary parameters of Job for scheduler, e.g. priority of Job, minimal available tasks of Job. `SchedulingSpec` does not include lifecycle parameters which are managed by following new **Kind**, named `QueueJob`:


s/defines necessary parameters of Job for scheduler/specifies the scheduling constraints of a Job/
s/minimal/minimum/

bsalamat · 2018-07-18T00:02:05Z

contributors/design-proposals/scheduling/gang-scheduling.md

+
+    // The minimal available pods to run for this QueueJob
+    // +optional
+    MinAvailable int32


Given that this type shows status of a job, shouldn't this field be NumAvailable?

bsalamat · 2018-07-18T00:04:26Z

contributors/design-proposals/scheduling/gang-scheduling.md

+
+### kar-scheduler (scheduler component of kube-arbitrator)
+
+The scheduler only watches `SchedulingSpec` and `Pod`; it'll reconstruct 'Job' by `OwerReference` of controller (`OwnerReference.Controller == true`), and the `Pod`s are consider as 'Task' of 'Job'. That requires controller set the `OwerReference`  correctly. Different from other components, `kar-scheduler` does not use `LabelSelector` to reconstruct the job because of performance concern.


s/consider/considered/

bsalamat · 2018-07-18T00:16:50Z

contributors/design-proposals/scheduling/gang-scheduling.md

+
+  2. `Number of Running pods` >= `.spec.SchedSpec.minAvailable`: If there are still enough running pods for the `QueueJob`, the `QueueJobController` will only recreate failed pods, and wait for scheduler to bind the pending pods.
+
+* **QueueJob Update/Delete**: There's not rolling update for `QueueJob` ; if pod template was updated, the whole `QueueJob` is re-created. If `QueueJob` was deleted, all its pods and `SchedulingSpec` will be cleanup.


Do we plans to support running long-running services as a gang (QueueJob) as well? If we don't, we may not need to support rolling updates for batch jobs which are expected to run to completion.

Do we plans to support running long-running services as a gang (QueueJob) as well?

Good question! I used to plan to support long-running services too; but for now, did not get requirement right now. So I'd like to make it batch work only firstly to make first release simple.

/cc @mindprince , fyi, some changes since we last sync up.

bsalamat · 2018-07-18T00:18:53Z

contributors/design-proposals/scheduling/gang-scheduling.md

+
+* **Pod Failures/Deletion/Unschedulable**: When a pod was failed, deleted or unschedulable, `QueueJobController` manages `QueueJob`'s lifecycle according to its status:
+
+  1. `Number of Running pods` < `.spec.SchedSpec.minAvailable` : If the number of running pods less then `.spec.SchedSpec.minAvailable`, the `QueueJobController` will kill the whole QueueJob and recreate pods for rescheduling in batch.


How do you deal with the last set of running jobs? For example, let's say that minAvailable is 20. Total number of tasks in the QueueJob is 100. 85 of them have already run and finished. 15 of them are still running. 15 is less than minAvailable, but we should let them run.

Yes; it need to include Succeeded pods: Succeeded + Runinning <= minAvailable

bsalamat · 2018-07-18T00:24:17Z

contributors/design-proposals/scheduling/gang-scheduling.md

+}
+```
+
+`QueueJob` is the default **Kind** for batch job, it includes multiple types of tasks according to the requirements from ML frameworks, e.g. Tensorflow. `QueueJob` also uses `SchedulingSpec` to describe its requirements to scheduler. The `QueueJob` is managed by `QueueJobController` in `kar-controllers`, the detail function is described in following section.


You haven't described users' workflow here, but I guess users only need to create a QueueJob and then the kar-controllers create Pods and SchedulingSpec. Is that right?
If so, why do we need SchedulingSpec? QueueJob has SchedulingSpecTemplate already and kar-scheduler can get MinAvailable from there. Other fields of SchedulingSpecTemplate will be available in the PodSpec of each Pod created for a QueueJob.

You haven't described users' workflow here, but I guess users only need to create a QueueJob and then the kar-controllers create Pods and SchedulingSpec. Is that right?

That's right !

If so, why do we need SchedulingSpec?

SchedulingSpec is used for customized controller, e.g. kubeflow/tf-operator. I do suggest to use QueueJob, but I do not want to make it as restriction currently. For example, kubeflow/tf-operator can using this by replacing PDB with SchedulingSpec.

bsalamat · 2018-07-19T23:37:05Z

contributors/design-proposals/scheduling/gang-scheduling.md

+![queuejob](images/gang-scheduling-queuejob.png)
+
+
+### Customized controller


We also need to cover admission in this document. I think we ideally want to admit either the whole gang or none of the members of a gang. Doing so looks challenging to me with the existing admission controllers. Do you have any ideas on addressing this issue?

Transactional multiobject admission is not possible.

Would this mean that we have to necessarily have a wrapper (maybe QueueJob) that will allow for admission of a gang of pods in one shot?

by the time scheduling happens, the pod objects have already been admitted, which means object counts, resource quotas, and pod spec restrictions have already been enforced... if the gang scheduler is just matching up pods to nodes, it doesn't seem like admission would cause a lot of problems

Would this mean that we have to necessarily have a wrapper (maybe QueueJob) that will allow for admission of a gang of pods in one shot?

no, you can't end-run around individual pod admission via another object

I think we ideally want to admit either the whole gang or none of the members of a gang.

It's better to delegate this to controllers or Applications; it's hard for admit to collect "dynamic info", e.g. for ResourceQuota, it's hard for admit to decide which pod should be create to meet 'Gang/PodSchedulingGroup' as much as possible. In google doc, the QueueJob will stop creating new Pod if rejected (there're several corner case to cover)

In google doc, the QueueJob will stop creating new Pod if rejected (there're several corner case to cover)

The problem is that the user experience would be poor. A few pods of a gang are created, but the rest fail admission and are not created. Those pod created will not be scheduled, but will consume the user's quota. I guess the user has to delete them manually. If we cannot have transactional admission, at lease we should have a controller clean up such pods.

A few pods of a gang are created, but the rest fail admission and are not created.

For batch or run-to-complete workload, it's better to have some pending jobs there; so k8s can start the pending jobs immediately when some running jobs finished, instead of waiting for client to submit new jobs; hmm...., a kind of pipeline. The difference is that
the "controller" has to make sure not introduce deadlock (e.g. for every pod group, just start part of them), only the "last" pod group should be impacted.

For example, in google doc, the QueueJobController will create pods one by one based on QueueJob's minAvaialbe; if minAvaialbe pods are not created, it'll NOT create pods for next QueueJob. If all minAvailable are meet, QueueJobController will create pods up to .spec.replicas in round-robin for each QueueJob. There're still some cases are not covered, e.g. update minAvailabe for a better SLA, new high priority job, start other jobs if can not meet minAvailable. That's another reason that prefer to focus on batch/run-to-complete firstly, and introduce QueueJob; I'm also ok to enhance Job to support those cases in future, and maybe Queue.

Currently, AdmissionController checks pod based on the 'snapshot' of status/config, e.g. ResourceQuota. But for gang scheduling, it also need to check whether controller WILL create more pods later to meet gang requirement. And a wrapper object can not help as some AdmissionController are still doing pod level check.

Another corner case is that: a hug job (e.g. 1000+ tasks) waits for long running run-to-complete jobs (e.g. hours or days) :(
The candidate solution is to support Queue, and balance the resource sharing between Queue. This's also in backlog.

bsalamat · 2018-07-19T23:52:24Z

contributors/design-proposals/scheduling/gang-scheduling.md

+
+## Function Detail
+
+### CRD Definition


I have a slightly different idea about the API. Can we create a namespaced object, let's call it a gang, that only specifies properties of a gang. Something like:

type Gang struct { metav1.TypeMeta metav1.ObjectMeta NumMembers int Policy GangLifeCyclePolicy }

Then Pod.PodSpec can have a field called GangName or GangId. When the field is not empty, it means that the pod is gang member. When scheduling pods which are gang members, scheduler makes sure that at least Gang.NumMember of such pods can be scheduled simultaneously or none of them are scheduled.

The Gang type can also include gang life cycle policies. For example, the policy can specify that when the number of running pods drops below NumMember kill all the other instances, or wait for a certain amount of time for the gang members to reach NumMember before killing other instances, ....

Here are some of the benefits of this design:

Existing workloads that use one of the K8s collections, e.g. ReplicaSet, Job, etc., can keep their config and only add the GangName to their pod template.

Flexibility. Pods of a gang can have different PodSpec and belong to different collections.

Existing controllers which are responsible for managing life cycle of collections work well with this design.

Users who have already learned and worked with K8s collections can keep using the familiar collections.

This sounds like something that should be a label before we create a field on pod spec (also gang wouldn’t be a great name for an api object).

We need a way to identify a gang where labels could suffice in theory. We also need to know how many pods comprise a gang (or minimum size of a gang). Encoding that value isn't ideal.
I'd recommend modifying @bsalamat's proposal slightly by including a label selector to identify pods belonging to a gang.

+1 for aiming to have compatibility with existing controllers, Job API at minimum.

Used to propose PodSchedulingGroup in the google doc, so I totally agree with you to provide such kind of API object in core.

Regarding label selector and 'GangID/GroupID', there're some cases we need to handle:

race condition: when a scheduler got a Pod w/o 'Gang/PodSchedulingGroup', how to schedule it?

If schedule the pod and no 'Gang/PodSchedulingGroup' coming, that's OK

If schedule the pod, 'Gang/PodSchedulingGroup' coming later and meet 'minAvailable', that's OK

If schedule the pod, 'Gang/PodSchedulingGroup' coming later but less than 'minAvailable':

if it's because of controller delay, schedule should wait

if there's no pod anymore, schedule should handle it by LifeCyclePolicy

If does not schedule the pod, how long should we wait?

performance: for a batch or run-to-complete workload, the pod come and go in short time; it may take time to
build relationship between 'Gang/PodSchedulingGroup' and Pod

summary/status: if 'Gang/PodSchedulingGroup' support cross Kind/Types, it's better to provide a way
show the summary/status, such succeeded, running, pending

Here's what I did in kube-arbitrator:

race condition: as a batch scheduler, kube-arbitrator do scheduling based on 'job'; so it'll not schedule pods
until get 'Gang/PodSchedulingGroup'

performance: using OwnerReference (controller==true); so 'GangID/GroupID' will be better for performance. 'label selector' is
kind of short term solution to avoid touching more Kind/Type

summary/status: it delegate 'Gang/PodSchedulingGroup' owner to handle summary/status

Another minor comment is the name: as you known, Gang-scheduling is one of requirements, we may enhance this object for other requirements later; so a more
general name maybe better, also including GangLifeCyclePolicy, e.g. changing to PodGroupLifeCyclePolicy. And suppose LifeCyclePolicy is going to be handled by kube-arbitrator.

I totally agree with @k82cn on label selector. I think a pod should identify itself as a gang member, not the label selector of a gang choosing pods. Klaus provided the reasons, so I don't repeat them again.
(Feel free to change the name of the object. I agree that Gang is not a good name for our API. I just used it to present the idea.)

+1 to general point raised by @bsalamat here!

For naming, how about something simple like SchedulingGroup? Then the gang-ness can be specified in the policy.

vishh · 2018-07-20T01:33:20Z

contributors/design-proposals/scheduling/gang-scheduling.md

+
+```go
+// QueueJob is a Kind for batch workload.
+type QueueJob struct {


If you were to consider the approach @bsalamat suggested, we would not need a queue job I think.

Yes; if so, I'd like to enhance 'Job' instead of introducing a new Kind.

For example, multiple pod template (or wrapper of multiple job).

k82cn · 2018-07-26T06:21:45Z

@bsalamat @vishh , update the doc according to comments, PTAL :)

Introduce PodSchedulingGroup into core
Added feature interactions section, e.g. Admission

vishh · 2018-07-26T17:34:56Z

Hey Klaus, I feel we can move this design further if we have a dedicated design review over VC in the next few weeks. They're are still quite a few changes I'd like to see in the design. Its also not clear what you're proposing to handle lack of transactional admission of pods.

…

On Wed, Jul 25, 2018, 11:21 PM Klaus Ma ***@***.***> wrote: @bsalamat <https://github.com/bsalamat> @vishh <https://github.com/vishh> , update the doc according to comments, PTAL :) 1. Introduce PodSchedulingGroup into core 2. Added feature interactions section, e.g. Admission — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2337 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKOEn6jz4iniOTAml59b3PD1cTua1ks5uKWAGgaJpZM4VALT2> .

k82cn · 2018-07-26T23:39:02Z

Sure, booked a meeting to sync up :)

…

-- Klaus On Fri, Jul 27, 2018 at 1:35 AM Vish Kannan <notifications@github.com> wrote:

Hey Klaus, I feel we can move this design further if we have a dedicated design review over VC in the next few weeks. They're are still quite a few changes I'd like to see in the design. Its also not clear what you're proposing to handle lack of transactional admission of pods. On Wed, Jul 25, 2018, 11:21 PM Klaus Ma ***@***.***> wrote: > @bsalamat <https://github.com/bsalamat> @vishh <https://github.com/vishh > > , update the doc according to comments, PTAL :) > > 1. Introduce PodSchedulingGroup into core > 2. Added feature interactions section, e.g. Admission > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #2337 (comment)>, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AGvIKOEn6jz4iniOTAml59b3PD1cTua1ks5uKWAGgaJpZM4VALT2 > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2337 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARRLLRZBm5G-0Yx-B28LXoNCS7j8Jc4Bks5uKf3TgaJpZM4VALT2> .

ConnorDoyle · 2018-07-26T23:45:10Z

Is that meeting open to the public? (Asking for an invite)

k82cn · 2018-07-27T01:08:09Z

Here's the link of google calendar :)

~~https://calendar.google.com/event?action=TEMPLATE&tmeid=NWpwNmdra3EzNmtvOGFlZG9tcjUxcWFvNnVfMjAxODA3MjhUMjMwMDAwWiBrbGF1czE5ODIuY25AbQ&tmsrc=klaus1982.cn%40gmail.com&scp=ALL~~

https://calendar.google.com/event?action=TEMPLATE&tmeid=NWpwNmdra3EzNmtvOGFlZG9tcjUxcWFvNnUga2xhdXMxOTgyLmNuQG0&tmsrc=klaus1982.cn%40gmail.com

k82cn · 2018-07-30T23:08:28Z

contributors/design-proposals/scheduling/gang-scheduling.md

+    // MinAvailable defines the minimal available tasks to run the Job;
+    // if there's not enough resources to start all tasks, the scheduler
+    // will not start anyone.
+    MinAvailable int


s/MinAvailable/NumMember/g

Minimum? It's likely rare to set less than Replicas, so this should be optional for simplicity.

@beberg it may be needed in many cases for a Job or ReplicaSet, etc. to have more replicas than those needed to be run together.

It's likely rare to set less than Replicas, so this should be optional for simplicity.

PodGroup is a common concept for "batch" feature; other Kind/type can also use that.

ConnorDoyle · 2018-07-31T22:01:18Z

contributors/design-proposals/scheduling/gang-scheduling.md

+}
+
+// PodSchedulingGroupTemplate represents the template of a pod group.
+type PodSchedulingGroupTemplate struct {


s/Template/Spec

beberg · 2018-08-02T17:24:31Z

contributors/design-proposals/scheduling/gang-scheduling.md

+| Unschedulable | Restart | 1.12    | The `PodSchedulingGroupController` deletes all `Running` Pods of `PodSchedulingGroup` with default grace period; it also records an event for `PodSchedulingGroup` when restarting. It's up to pod's controller to re-create new pods. |
+| Unschedulable | None    | 1.12    | The `PodSchedulingGroupController` does nothing              |
+
+There's a limitation that: when setting policy to `Action: Restart, Phase: Failed`, the `PodSchedulingGroup` may keep restarting in some cases, e.g. bug of the application in pod. The solution, e.g. `RetryCount`, will be proposed in coming release.


RetryCount is universal enough it should be specified now to prevent a proliferation of solutions. HPC workloads are prone to failures and checkpoint/restart mechanics are common. RestartPolicy Never leads to far too many failed jobs that could have been resumed automatically (overnight, weekends, ...), so PodSchedulingGroupTemplate.RetryCount is needed.

We'll discuss this later; IMO, RetryCount is a common requirement, not only to batch.

beberg · 2018-08-02T17:47:05Z

contributors/design-proposals/scheduling/gang-scheduling.md

+// according to Pod's phase.
+type LifeCyclePolicy struct {
+    // The action that will be taken to the PodSchedulingGroup according to
+    // Pod's phase. One of "Restart", "None".


Since the context and point of this API is gang scheduling, "Restart" is the more logical default here for both "PodFailed" and "Unschedulable". The none/none can be done with exiting APIs today. Would also like to know the use case in mind for restart/none and none/restart, as they seem to be non-gang scenarios covered by existing APIs.

Would also like to know the use case in mind for restart/none and none/restart

What does "restart/none and none/restart" mean?

bsalamat · 2018-08-03T22:50:53Z

contributors/design-proposals/scheduling/gang-scheduling.md

+
+```go
+// PodSchedulingGroup defines the scheduling requirement of a pod group
+type PodSchedulingGroup struct {


What do you think about renaming this to just PodGroup?

bsalamat · 2018-08-03T22:53:02Z

contributors/design-proposals/scheduling/gang-scheduling.md

+    // MinAvailable defines the minimal available tasks to run the Job;
+    // if there's not enough resources to start all tasks, the scheduler
+    // will not start anyone.
+    MinAvailable int


@beberg it may be needed in many cases for a Job or ReplicaSet, etc. to have more replicas than those needed to be run together.

bsalamat · 2018-08-03T22:58:10Z

contributors/design-proposals/scheduling/gang-scheduling.md

+
+    // The phase of pod; the controller takes actions according to this
+    // pod's phase. One of "PodFailed", "Unschedulable".
+    Event Event


I think we should also add a Timeout field here. When Timeout is specified, the controller waits for the Group members to satisfy the Group requirements. If the timeout expires, it takes the Action.

What does PodFailed mean in this context? Should this field instead be a Phase that indicates the state of the gang/group?

we should also add a Timeout field here.

+1, and Done

What does PodFailed mean in this context? Should this field instead be a Phase that indicates the state of the gang/group?

That means at least one Pod of Group are failed; PodGroupController need to take action accordingly. Did not introduce Phase for PodGroup as only the Pod's "controller" knows phase; for example, for a batch job, there maybe Failed, Done, Suspend, Terminated and so on Phase; PodGroupController does not know that :(.

k82cn · 2018-10-11T00:46:23Z

@bsalamat , @vishh do you have more comments ? :)

bsalamat

Thanks, Klaus. I think this is a good starting point. I have several minor comments. Otherwise, looks good.

bsalamat · 2018-10-11T03:23:38Z

contributors/design-proposals/scheduling/coscheduling.md

+
+As the lifecycle of Pods in PodGroup may be different from controller to another, the lifecycle of the members is not managed by the coscheduling feature. Each collection controller may implement or already have the mean to manage lifecycle of its members. The scheduler'll record related events for controller to manage pods, e.g. `Unschedulable`.
+
+The update to `PodGroup` is not support for now; and deleting `PodGroup` does not impact Pod's status.


nit: s/support/supported/

bsalamat · 2018-10-11T03:26:08Z

contributors/design-proposals/scheduling/coscheduling.md

+As batch scheduler and default scheduler may be running in parallel; the batch scheduler follows multi-scheduler feature to only handle the `Pod` that submitted to it. The batch scheduler does scheduling as follow:
+
+1. Reconstruct 'Job' by the annotation of `Pod` and `PodGroup`
+2. If there are less Pods than `numMembers` of `PodGroup`, the 'job' will not be scheduled; and an unschedulable event of `Pod` will be recorded


please replace numMembers with minMembers here and below.

bsalamat · 2018-10-11T03:26:56Z

contributors/design-proposals/scheduling/coscheduling.md

+
+As batch scheduler and default scheduler may be running in parallel; the batch scheduler follows multi-scheduler feature to only handle the `Pod` that submitted to it. The batch scheduler does scheduling as follow:
+
+1. Reconstruct 'Job' by the annotation of `Pod` and `PodGroup`


It is not clear what this item means. What do you mean by "reconstructing job"?

bsalamat · 2018-10-11T03:27:17Z

contributors/design-proposals/scheduling/coscheduling.md

+}
+```
+
+The `PodGroup`, which is a namespaced object, specifies the attributes and status of a pod group, e.g. number of pods in a group. To define which pods are member of `PodGroup`, the following annotation key is introduced for `Pod`; the annotation key is used for this alpha feature, and it'll be changed to a more permanent form, such a field, when moing `PodGroup` to core.


s/moing/moving/

bsalamat · 2018-10-11T03:41:07Z

contributors/design-proposals/scheduling/coscheduling.md

+1. Reconstruct 'Job' by the annotation of `Pod` and `PodGroup`
+2. If there are less Pods than `numMembers` of `PodGroup`, the 'job' will not be scheduled; and an unschedulable event of `Pod` will be recorded
+3. In `allocate` phase, scheduler will
+   * record an `Unschedulable` event of `PodGroup` if some pods are running but `succeeded + pending + running < numMembers`, the controller takes action according to its configuration


Today, we do not delete completed pods as long as their parent object (such as Job, ReplicaSet, etc) is not deleted. This behavior is essential for the proposed algorithm to work correctly.

(This is a note to ourselves. Nothing needs to be changed in the design.)

Is the minMembers field in PodGroup expected to track the number of non-terminated pods or all pods that have been successfully admitted?

Vish, I think "succeeded" means those pods that have run and finished successfully. This makes sense for "run-to-completion" pods.

bsalamat · 2018-10-11T03:45:47Z

contributors/design-proposals/scheduling/coscheduling.md

+
+### Multi-scheduler
+
+Since multiple schedulers work in parallel, there may be decision conflict between different schedulers; and the kubelet will reject one pod (failed) if conflict. The controller will handle rejected pods based on its lifecycle policy for failed pods. It up to cluster admin to avoid such kind of conflict, e.g. node affinity which is supported by kube-arbitrator in 1.13.


I would change the last sentence to
"Users and cluster admins may reduce the probability of such conflicts by partitioning the clusters logically, for example, by placing node-affinity to distinct set of nodes on various groups of pods."

contributors/design-proposals/scheduling/coscheduling.md

vishh · 2018-10-11T19:07:14Z

contributors/design-proposals/scheduling/coscheduling.md

+2. Pods of a group/gang may have different `PodSpec` (and/or belong to different collections)
+3. Existing controllers which are responsible for managing life cycle of collections work well with this feature
+
+To meet the requirements above, the following **Kind** is introduced by CRD under `scheduling.incubator.k8s.io/v1alpha1` **Group**/**Version**.


nit: Should it be incubator.scheduling.k8s.io? That way once it's out of incubation we only need to drop a sub-domain?

Yours seems better :)

vishh · 2018-10-11T19:09:38Z

contributors/design-proposals/scheduling/coscheduling.md

+
+    // The number of pods which reached phase Failed.
+    // +optional
+    Failed int32


Why is tracking Succeeded and Failed useful at the pod Group level? This seems to make an assumption that pod groups are typically used only for run to completion jobs.

Just give a summary of pod status here; not only for completion jobs.

vishh · 2018-10-11T19:10:42Z

contributors/design-proposals/scheduling/coscheduling.md

+The `PodGroup`, which is a namespaced object, specifies the attributes and status of a pod group, e.g. number of pods in a group. To define which pods are member of `PodGroup`, the following annotation key is introduced for `Pod`; the annotation key is used for this alpha feature, and it'll be changed to a more permanent form, such a field, when moing `PodGroup` to core.
+
+```go
+scheduling.k8s.io/group-name


nit: Mark it alpha explicitly & be consistent with the object kind - alpha.incubator.scheduling.k8s.io

hm .... , mentioned this alpha feature in doc and release notes (in future); so when we migrate this feature to upstream, the user did not need update annotation in pod.

vishh · 2018-10-11T19:19:03Z

contributors/design-proposals/scheduling/coscheduling.md

+
+### Scheduling
+
+The scheduler only watches `PodGroup` and `Pod`. It'll reconstruct 'Job' by annotation of Pod and `PodGroup`, the `Pod`s are considered as 'Task' of 'Job'; if annotation is empty, the scheduler records an unschedulable event of pod to ask user/controller to resubmit it. The schduler does not schedule pods until its `PodGroup` is created.


if annotation is empty, the scheduler records an unschedulable

So the batch scheduler that you intend to develop will not schedule pods that do not need gang scheduling?

Yes, for the first version. As we do not know when/whether scheduler will get PodGroup in future, the PodGroup is required currently, but user can set minMember to 1.

I'm also thinking to add another "flag" to identify whether PodGroup is required to schedule the pod. Supporting jobs without ~~gang-scheduling~~ PodGroup in kube-batch should be another proposal.

vishh · 2018-10-11T19:20:05Z

contributors/design-proposals/scheduling/coscheduling.md

+As batch scheduler and default scheduler may be running in parallel; the batch scheduler follows multi-scheduler feature to only handle the `Pod` that submitted to it. The batch scheduler does scheduling as follow:
+
+1. Reconstruct 'Job' by the annotation of `Pod` and `PodGroup`
+2. If there are less Pods than `numMembers` of `PodGroup`, the 'job' will not be scheduled; and an unschedulable event of `Pod` will be recorded


implementation note: Do not overwhelm the system with repeated unschedulable events.

vishh · 2018-10-11T20:41:41Z

contributors/design-proposals/scheduling/coscheduling.md

+   * record an `Unschedulable` event of `PodGroup` if some pods are running but `succeeded + pending + running < numMembers`, the controller takes action according to its configuration
+   * allocate (but not bind) resource to Pods according to Pod's spec, e.g. `NodeAffinity`
+   * bind all Pods to hosts until job is ready: if `numMembers` <= `allocated Pods` + `pending Pods`, it's ready when `numMembers` <= `allocated Pods`; otherwise, `numMember` <= `allocated Pods` + `succeeded Pods`
+4. If can not allocate enough resources to the job, the pods stay pending; and the resource cannot be allocated to other job


Two issues to consider:

There is the risk of split brain where batch scheduler and the default scheduler might schedule onto the same resources which could in theory lead to one of the co-scheduled pods being rejected by the kubelet. What would you do if the batch scheduler cannot re-schedule that pod? Do we fall back on the controller to kill & resubmit the entire job? That feels like a sub-optimal solution from a user POV. Would it be possible to assign phantom pods to nodes to hold on to resources?

Deadlock in the batch scheduler - If there are two jobs and both of them have half of their jobs allocated (not bound), then how will you make progress? Have you considered doing a transactional allocation - allocate all or free up allocated resources?

Do we fall back on the controller to kill & resubmit the entire job? That feels like a sub-optimal solution from a user POV.

Yes, controller will re-submit the entire job. phantom pods maybe also preempted by high priority pods from default scheduler. For such a mix environment, prefer to only share resources between elastic 'jobs', e.g. Spark's executor and nginx/tomcat.

Deadlock in the batch scheduler ... allocate all or free up allocated resources?

In kube-batch 0.1, we use this option; two major concerns: 1. there're several computing here, pre-allocated and free up, 2. after free up, the resource may not be used by others, e.g. predicates.
So in kube-batch 0.2/0.3, I'm thinking to use preemption (or backfill) to handle this case: kube-batch will try to allocate resource by jobs's minMember as much as possible; if not enough resource, preempt allocated resource firstly which is occupied by pod-group but can not use it.

So in kube-batch 0.2/0.3, I'm thinking to use preemption (or backfill) to handle this case: kube-batch will try to allocate resource by jobs's minMember as much as possible; if not enough resource, preempt allocated resource firstly which is occupied by pod-group but can not use it.

What triggers this "preemption"? When there are two pod groups with the same priority and each one is partially allocated and there is no more resources in the cluster, which one preempts the other?

not sure which term is better: preempt or backfill; anyway, the assumed/allocated resource will be in "Allocated" state if the job did not get enough resource; and after allocation phase, those allocated but not bound resource will be re-calculated for other jobs. For your case, the two jobs will try to get those allocated resources in order (e.g. FCFS).

I am not sure if FCFS works in this case. What if the pods of the two jobs created/processed alternatively? For example, batch scheduler processes pod 1 of job 1 and then processes pod 1 of job 2, then pod 2 of job 1, then pod 2 of job 2, and so on.

in kube-batch, it'll do allocation in job/podgroup level (FCFS also for job/podgroup); so for this case, we will not handle pod2 of job2 until all pod in job1 are handled :)

vishh · 2018-10-11T20:42:09Z

contributors/design-proposals/scheduling/coscheduling.md

+   * bind all Pods to hosts until job is ready: if `numMembers` <= `allocated Pods` + `pending Pods`, it's ready when `numMembers` <= `allocated Pods`; otherwise, `numMember` <= `allocated Pods` + `succeeded Pods`
+4. If can not allocate enough resources to the job, the pods stay pending; and the resource cannot be allocated to other job
+
+That may make resources (less than job's resource request) idle for a while, e.g. a huge job. The solution, e.g. backfill other smaller jobs to improve the resource utilization, will be proposed in coming release. In `allocate` phase, only pod's `NodeAffinity` takes effect; the other predicates/priorities will be included on-demand in coming release.


backfilling soungs too optimistic. many k8s clusters may not have that many jobs to backfill.

it's only one of ~~solutions~~ options :)

vishh · 2018-10-11T20:44:47Z

contributors/design-proposals/scheduling/coscheduling.md

+
+### Admission Controller
+
+If quota runs out in the middle of creating a group of pods, a few members of a `PodGroup` may be created, while the rest will be denied by the `ResourceQuota` admission controller. `.spec.TotalResource` is added in `PodGroup` to address this problem. When a `PodGroup` is created with `.spec.TotalResource`, so much quota is reserved for the group if there is available quota. Pods of group use the already reserved quota. By setting `.spec.TotalResource` properly, one can ensure that Pods of a `PodGroup` have enough quota at creation time.


Do mention the changes required to the default quota controller to make this proposal work. That will be changes to the core project and needs to be acceptable to other quota subsystem owners.

vishh

Klaus, there are still some open questions & comments. Can you run through them once?

Signed-off-by: Da K. Ma <klaus1982.cn@gmail.com>

justaugustus · 2018-10-13T03:33:16Z

/kind kep

liggitt · 2018-10-18T19:48:13Z

just to verify, there are no in-tree changes proposed now, correct? this was a question in sig-arch while going through the features targeted at 1.13

k82cn · 2018-10-18T22:16:33Z

just to verify, there are no in-tree changes proposed now, correct?

@liggitt , yes, we do NOT have code changes in-tree for this feature in 1.13.

bsalamat · 2018-11-06T23:01:21Z

/assign

justaugustus · 2018-11-20T04:45:41Z

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

justaugustus · 2018-12-01T08:05:06Z

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

k8s-ci-robot · 2018-12-01T08:05:07Z

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k82cn · 2018-12-03T01:54:09Z

Moved to kubernetes/enhancements#639

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 3, 2018

k8s-ci-robot requested review from bsalamat and michelleN July 3, 2018 02:00

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jul 3, 2018

This was referenced Jul 3, 2018

Coscheduling kubernetes/enhancements#583

Closed

Gang scheduling kubernetes-retired/kube-batch#248

Closed

Enable kube-arbitrator as scheduler for tensorflow kubeflow/training-operator#349

Closed

etiennecoutaud reviewed Jul 3, 2018

View reviewed changes

k82cn mentioned this pull request Jul 5, 2018

Support multiple pod template in QueueJob kubernetes-retired/kube-batch#266

Closed

k82cn force-pushed the kep_583 branch from 624da1b to c6a1d38 Compare July 10, 2018 07:08

bsalamat reviewed Jul 18, 2018

View reviewed changes

bsalamat reviewed Jul 20, 2018

View reviewed changes

vishh reviewed Jul 20, 2018

View reviewed changes

k82cn force-pushed the kep_583 branch from c6a1d38 to d84d6d0 Compare July 26, 2018 03:22

k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Jul 26, 2018

k82cn changed the title ~~Design document of Gang scheduling~~ KEP: Gang scheduling Jul 26, 2018

k82cn force-pushed the kep_583 branch from d84d6d0 to e2af94c Compare July 26, 2018 04:17

k82cn commented Jul 30, 2018

View reviewed changes

ConnorDoyle reviewed Jul 31, 2018

View reviewed changes

beberg reviewed Aug 2, 2018

View reviewed changes

bsalamat reviewed Aug 3, 2018

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 3, 2018

k82cn mentioned this pull request Oct 6, 2018

Support 'spec.TotalResources' in PodGroup kubernetes-retired/kube-batch#401

Closed

bsalamat reviewed Oct 11, 2018

View reviewed changes

vishh reviewed Oct 11, 2018

View reviewed changes

k82cn force-pushed the kep_583 branch from 961c4c8 to 7728cd0 Compare October 12, 2018 06:02

Coscheduling

93b2b0b

Signed-off-by: Da K. Ma <klaus1982.cn@gmail.com>

k82cn force-pushed the kep_583 branch from 7728cd0 to 93b2b0b Compare October 12, 2018 06:03

k8s-ci-robot added the kind/kep label Oct 13, 2018

k82cn mentioned this pull request Oct 15, 2018

Added tutorials for features kubernetes-retired/kube-batch#434

Closed

5 tasks

jeefy mentioned this pull request Oct 17, 2018

No meaningful status if gang-scheduled job has insufficient resources kubernetes-retired/kube-batch#447

Closed

k8s-ci-robot assigned bsalamat Nov 6, 2018

k8s-ci-robot closed this Dec 1, 2018

k82cn mentioned this pull request Dec 2, 2018

Coscheduilng. kubernetes/enhancements#639

Merged

k82cn deleted the kep_583 branch December 3, 2018 01:54


		## References

		* [Gang scheduling in Kuberentes](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv)


		## Motivation

		After the discussion at [Gang-scheduling](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv) proposal, we decide to make Gang-scheduling API out of core by CRD, and implement it in [kube-arbitrator](https://github.com/kubernetes-incubator/kube-arbitrator). kube-arbitrator focus on "batch" workload in kubernetes, and will share the same [scheduling frameworks](https://github.com/kubernetes/community/pull/2281) when it's ready. This document is used to provide definition of CRD for gang-scheduling, and suggestion of customized job controllers.


		### CRD Definition

		Although workload lifecycle requirements maybe different, e.g. MPI vs. Spark, the requirements to scheduler are similar. To meet both scheduling and lifecycle requirements, two Kinds are going to be defined. One new Kind, named `SchedulingSpec`, is defined as follow for scheduler:


		### kar-scheduler (scheduler component of kube-arbitrator)

		The scheduler only watches `SchedulingSpec` and `Pod`; it'll reconstruct 'Job' by `OwerReference` of controller (`OwnerReference.Controller == true`), and the `Pod`s are consider as 'Task' of 'Job'. That requires controller set the `OwerReference` correctly. Different from other components, `kar-scheduler` does not use `LabelSelector` to reconstruct the job because of performance concern.


		2. `Number of Running pods` >= `.spec.SchedSpec.minAvailable`: If there are still enough running pods for the `QueueJob`, the `QueueJobController` will only recreate failed pods, and wait for scheduler to bind the pending pods.

		* QueueJob Update/Delete: There's not rolling update for `QueueJob` ; if pod template was updated, the whole `QueueJob` is re-created. If `QueueJob` was deleted, all its pods and `SchedulingSpec` will be cleanup.


		* Pod Failures/Deletion/Unschedulable: When a pod was failed, deleted or unschedulable, `QueueJobController` manages `QueueJob`'s lifecycle according to its status:

		1. `Number of Running pods` < `.spec.SchedSpec.minAvailable` : If the number of running pods less then `.spec.SchedSpec.minAvailable`, the `QueueJobController` will kill the whole QueueJob and recreate pods for rescheduling in batch.

		![queuejob](images/gang-scheduling-queuejob.png)


		### Customized controller


		As the lifecycle of Pods in PodGroup may be different from controller to another, the lifecycle of the members is not managed by the coscheduling feature. Each collection controller may implement or already have the mean to manage lifecycle of its members. The scheduler'll record related events for controller to manage pods, e.g. `Unschedulable`.

		The update to `PodGroup` is not support for now; and deleting `PodGroup` does not impact Pod's status.


		As batch scheduler and default scheduler may be running in parallel; the batch scheduler follows multi-scheduler feature to only handle the `Pod` that submitted to it. The batch scheduler does scheduling as follow:

		1. Reconstruct 'Job' by the annotation of `Pod` and `PodGroup`


		### Multi-scheduler

		Since multiple schedulers work in parallel, there may be decision conflict between different schedulers; and the kubelet will reject one pod (failed) if conflict. The controller will handle rejected pods based on its lifecycle policy for failed pods. It up to cluster admin to avoid such kind of conflict, e.g. node affinity which is supported by kube-arbitrator in 1.13.


		### Scheduling

		The scheduler only watches `PodGroup` and `Pod`. It'll reconstruct 'Job' by annotation of Pod and `PodGroup`, the `Pod`s are considered as 'Task' of 'Job'; if annotation is empty, the scheduler records an unschedulable event of pod to ask user/controller to resubmit it. The schduler does not schedule pods until its `PodGroup` is created.


		### Admission Controller

		If quota runs out in the middle of creating a group of pods, a few members of a `PodGroup` may be created, while the rest will be denied by the `ResourceQuota` admission controller. `.spec.TotalResource` is added in `PodGroup` to address this problem. When a `PodGroup` is created with `.spec.TotalResource`, so much quota is reserved for the group if there is available quota. Pods of group use the already reserved quota. By setting `.spec.TotalResource` properly, one can ensure that Pods of a `PodGroup` have enough quota at creation time.

KEP: Coscheduling #2337

KEP: Coscheduling #2337

Conversation

k82cn commented Jul 3, 2018 • edited Loading

Choose a reason for hiding this comment

bsalamat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn Jul 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn Jul 20, 2018 • edited Loading

Choose a reason for hiding this comment

bsalamat Jul 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn commented Jul 26, 2018

vishh commented Jul 26, 2018 via email

k82cn commented Jul 26, 2018 via email

ConnorDoyle commented Jul 26, 2018

k82cn commented Jul 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beberg Aug 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn commented Oct 11, 2018

bsalamat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn commented Jul 3, 2018 •

edited

Loading

k82cn Jul 22, 2018 •

edited

Loading

k82cn Jul 20, 2018 •

edited

Loading

bsalamat Jul 20, 2018 •

edited

Loading

k82cn commented Jul 27, 2018 •

edited

Loading

beberg Aug 2, 2018 •

edited

Loading